This project is a tutorial of using multivaraite time series analysis for the stock market index, NIFTY 50 from NSE (National Stock Exchange) India. The data is obtained from Nifty 50 contains price history and trading volumes of fifty stocks in India from 2000-01-03 to 2020-09-30.
We illustrates how to using Python, R, and Stata to apply Auto Regressive Integrated Moving Average (ARIMA) to time series data. ARIMA is able to fit a given non-seasonal non-stationary time series based on its lag values. A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.
A general ARIMA model consists of three parts: the “AR” part means the variable of interest is regressed on its lag terms, the “I” part means the differenced values are used, and the “MA” part means the regression error is modelled as a linear combination of error terms in the past. The purpose of using differenced terms is to make the time series stationary for autoregression.
An ARIMA model is characterized by 3 terms: p (the order of AR term), q (the order of the MA term), and d (number of differencing to make time series stationary). Given a time series \(\{X_t\}\), an \(ARIMA(p, d, q)\) model can be expressed as: \[(1-\sum_{i=1}^p\phi_iL^i)(1-L)^dX_t= (1+\sum_{i=1}^q\theta_iL^i)\epsilon_t + \delta\] where \(\epsilon_t\) is the error term, \(L\) is the lag operator, i.e. \(LX_t = X_{t-1}, \forall t>1\), \(p\) is the number of lagged terms of \(X\), \(d\) is the number of times of differencing needed for stationarity, \(q\) is the number of lagged forecast errors in prediction, \(\delta\) is the interception term for the regression, and \(\theta, \phi\)’s are the estimated regression coefficients.
ARIMA models are fitted in order to understand the data better and forecast future data. They are based on linear regression models. The best model can be chosen using AIC or BIC.
NIFTY 50 data consist of 50 stocks, 230104 observations on 15 variables. The data contains daily open, close, highest and lowest prices, volume and other relevant information for the “Nifty Fifty” stocks since January 2000. Detailed variable descriptions are shown in Table 1 below.
| Variable Name | Variable Description |
|---|---|
| Date | Date of trade |
| Symbol | Name of the company |
| Series | We have only one series: Equity(EQ) |
| Prev Close | Previous day’s close price |
| Open | Open price of day |
| High | Highest price in day |
| Low | Lowest price in day |
| Last | Last traded price in day |
| Close | Close price of day |
| VWAP | Volume Weighted Average Price |
| Volume | A measure of sellers versus buyers of a particular stock |
| Turnover | The number of shares available for sale |
| Trades | The number of shares traded |
| Deliverable Volume | Shares which are actually transferred among demat accounts |
| %Deliverable | Percent of deliverble volume |
| Return | Return of trade |
As we are more interested in the stock prices, we use the variable VWAP for the most part. It can summarize the average price of the stock on a trading day. We want to catch the trend of stock prices across the years and possibly forecast future stock prices.
Firstly, we import the data.
import sys
sys.path.append('/usr/local/lib/python3.8/site-packages')
# Just to make sure the packages are loaded.
import pandas as pd
import numpy as np
df = pd.read_csv('./NIFTY50_all.csv')
The variable Trades has 50% missing values, we can delete it. We can delete redundant variables and impute %Deliverable via Simple Imputation because there are not so many missing values (about 7%). We also need to merge different symbols for the same stock.
from datetime import datetime # Dealing with date format data
from sklearn.impute import SimpleImputer # Impute data
# Drop redundant variables and variables with too many missing values
df['Date'] = [datetime.strptime(x, '%Y-%m-%d') for x in df['Date']]
df1 = df.drop(['Trades', 'Deliverable Volume', 'Series'], axis=1)
ls1 = ['MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE']
ls2 = ['ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL']
df1['Symbol'] = df1['Symbol'].replace(ls1, ls2)
df1['Symbol'] = pd.Categorical(df1['Symbol'])
# Impute missing values
df2 = pd.get_dummies(data=df1, drop_first=True)
df2['Date']=df2['Date'].map(datetime.toordinal)
imp = SimpleImputer()
p = imp.fit_transform(df2)
df1['%Deliverble'] = p[:, 10]
Before conducting core analysis, let’s clean our data and check basic data structure.
colnames(nifty)[colSums(is.na(nifty)) > 0]
## [1] "Trades" "Deliverable Volume" "%Deliverable"
As we can see, variables Trade, Deliverable Volume, and %Deliverable has missing values and we need to convert them to 0. Besides, we found out that there are stocks that changed its names during 2000 to 2020 period, so we need to bring their names into accord.
library(dplyr)
library(ggplot2)
# change old stock names to new
old_name = c('MUNDRAPORT', 'UTIBANK', 'BAJAUTOFIN', 'BHARTI', 'HEROHONDA',
'HINDALC0', 'HINDLEVER', 'INFOSYSTCH', 'JSWSTL', 'KOTAKMAH', 'TELCO',
'TISCO', 'UNIPHOS', 'SESAGOA', 'SSLT', 'ZEETELE')
new_name = c('ADANIPORTS', 'AXISBANK', 'BAJFINANCE', 'BHARTIARTL', 'HEROMOTOCO',
'HINDALCO', 'HINDUNILVR', 'INFY', 'JSWSTEEL', 'KOTAKBANK', 'TATAMOTORS',
'TATASTEEL', 'UPL', 'VEDL', 'VEDL', 'ZEEL')
nifty$Symbol = plyr::mapvalues(nifty$Symbol, from = old_name, to = new_name)
# summary statistics of variable of interest
nifty_clean = nifty %>%
replace(is.na(.), 0) %>%
mutate(Return = Close - `Prev Close`) %>%
select(Date, Symbol, VWAP, Volume, Turnover, Return)
summary(nifty_clean)
## Date Symbol VWAP Volume
## Min. :2000-01-03 Length:230104 Min. : 9.21 Min. : 3
## 1st Qu.:2006-05-23 Class :character 1st Qu.: 272.43 1st Qu.: 209105
## Median :2011-06-07 Mode :character Median : 554.00 Median : 973757
## Mean :2011-02-20 Mean : 1224.17 Mean : 2745270
## 3rd Qu.:2016-02-05 3rd Qu.: 1213.47 3rd Qu.: 2836929
## Max. :2020-09-30 Max. :32975.24 Max. :481058927
## Turnover Return
## Min. :1.047e+07 Min. :-19525.95
## 1st Qu.:1.520e+13 1st Qu.: -5.80
## Median :6.400e+13 Median : 0.05
## Mean :1.433e+14 Mean : 0.25
## 3rd Qu.:1.710e+14 3rd Qu.: 6.40
## Max. :3.560e+16 Max. : 2107.70
Firstly, we import the data and do some data cleaning. We drop the variable and %Deliverable. Also, we transform the variable from string type to date type to treat the whole data set as time series data set.
import delimited NIFTY50_all.csv, clear
* Data Cleaning
gen date2 = date(date, "YMD")
format date2 %tdCCYY-nn-dd
drop date series
drop trades deliverablevolume
rename date2 date
label variable date "Date"
* Replace Symbol Names
replace symbol = "ADANIPORTS" if symbol == "MUNDRAPORT"
replace symbol = "AXISBANK" if symbol == "UTIBANK"
replace symbol = "BAJFINANCE" if symbol == "BAJAUTOFIN"
replace symbol = "BHARTIARTL" if symbol == "BHARTI"
replace symbol = "HEROMOTOCO" if symbol == "HEROHONDA"
replace symbol = "HINDALCO" if symbol == "HINDALC0"
replace symbol = "HINDUNILVR" if symbol == "HINDLEVER"
replace symbol = "INFY" if symbol == "INFOSYSTCH"
replace symbol = "JSWSTEEL" if symbol == "JSWSTL"
replace symbol = "KOTAKBANK" if symbol == "KOTAKMAH"
replace symbol = "TATAMOTORS" if symbol == "TELCO"
replace symbol = "TATASTEEL" if symbol == "TISCO"
replace symbol = "UPL" if symbol == "UNIPHOS"
replace symbol = "VEDL" if symbol == "SESAGOA"
replace symbol = "VEDL" if symbol == "SSLT"
replace symbol = "ZEEL" if symbol == "ZEETELE"
* Save the cleaned data
save NIFTY_clean, replace
The main packages used in Python tutorial are:
pandas: For data cleaning and data frame mutationsnumpy: For numeric data processingdatetime: For date format data processingsklearn: For missing data imputingstatsmodels and pmdarima: For the main ARIMA modelmatplotlib: For making plotsWe can take a look at the data. Take the stock “ADANIPORTS” as an example.
import matplotlib.pyplot as plt # Plotting package in python
names = df1['Symbol'].cat.categories
example = df1[df1['Symbol'] == names[0]]
fig, ax = plt.subplots(3, 1, figsize=(8, 8))
ax[0].plot(example['Date'], example['VWAP'])
ax[0].set_xticks([])
ax[0].set_xlabel('Days')
ax[0].set_ylabel('Volume weighted average price')
ax[1].plot(example['Date'], example['Volume'])
ax[1].set_xticks([])
ax[1].set_xlabel('Days')
ax[1].set_ylabel('Volume')
ax[2].plot(example['Date'], example['Turnover'])
ax[2].set_xticks([])
ax[2].set_xlabel('Days')
ax[2].set_ylabel('Turnover')
ax[0].set_title('Time series plots of stock %s' % names[0])
Python tutorial will use the time series VWAP for the analysis below.
The differencing parameter \(d\) of the model can be determined by doing Augmented Dickey-Fuller tests, which can indicate whether the time series are stationary. See the reference for more details about ADF tests. The python packages statsmodels.tsa and pmdarima.arima are very helpful here.
from statsmodels.tsa.arima_model import ARIMA
from pmdarima.arima.utils import ndiffs # ARIMA model packages
# This function chooses the smallest d for the series to be stationary
names = df1['Symbol'].cat.categories
ls0 = []
for i in names:
subdf = df1[df1['Symbol'] == i]
# Select the rows for stock i
ls0.append(ndiffs(subdf['VWAP'], test='adf'))
ls0 # Most values are 1
## [0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
max(ls0) # We don't need 2-order differencing
## 1
In order to ensure all time series are stationary, we choose \(d=1\) for all stocks.
The AR parameter \(p\) of the model can be determined by looking at Partial Autocorrelation plots. These plots indicate the correlation between the series and its lag. We use the first 4 stocks alphabetically as a sample. Notice the series need to be differenced first (\(d=1\)).
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, ax = plt.subplots(1, 4, figsize=(15, 5))
# Take the first 4 stocks as a sample
for i in range(4):
subdf = df1[df1['Symbol'] == names[i]]
# Select the rows for stock number i
plot_pacf(subdf['VWAP'].diff().dropna(), ax=ax[i])
ax[i].set_title('PACF plot for stock %s' % names[i])
Notice lag 1 is at least borderline significant in the plots for all stocks, but lag 2 is not significant. Therefore, we can choose \(p=1\) for all stocks.
The MA parameter \(q\) of the model can be determined by looking at Autocorrelation (ACF) plots. We use the next 4 stocks alphabetically as a sample. Similarly, the series need to be differenced (\(d=1\)).
fig, ax = plt.subplots(1, 4, figsize=(15, 5))
for i in range(4):
subdf = df1[df1['Symbol'] == names[i+4]]
plot_acf(subdf['VWAP'].diff().dropna(), ax=ax[i])
ax[i].set_title('ACF plot for stock %s' % names[i+4])
Notice lag 1 is significant for most of the stocks but lag 2 is not. Therefore we can choose \(q=1\) for the MA term.
According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. After fitting the models, we can view some of the summaries. Take the first 5 stocks alphabetically as an example.
mlist = [] # Models
flist = [] # Model fits
for i in names:
subdf = df1[df1['Symbol'] == i]
m = ARIMA(list(subdf['VWAP']), order=(1, 1, 1))
mlist.append(m)
flist.append(m.fit(disp=0))
for i in range(5):
print(flist[i].summary())
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 3178
## Model: ARIMA(1, 1, 1) Log Likelihood -13340.537
## Method: css-mle S.D. of innovations 16.100
## Date: Tue, 24 Nov 2020 AIC 26689.074
## Time: 11:39:50 BIC 26713.330
## Sample: 1 HQIC 26697.773
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const -0.2048 0.314 -0.652 0.514 -0.820 0.411
## ar.L1.D.y 0.1049 0.151 0.694 0.488 -0.192 0.401
## ma.L1.D.y -0.0156 0.152 -0.103 0.918 -0.313 0.282
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 9.5325 +0.0000j 9.5325 0.0000
## MA.1 64.1530 +0.0000j 64.1530 0.0000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 5162
## Model: ARIMA(1, 1, 1) Log Likelihood -29071.318
## Method: css-mle S.D. of innovations 67.549
## Date: Tue, 24 Nov 2020 AIC 58150.637
## Time: 11:39:50 BIC 58176.833
## Sample: 1 HQIC 58159.803
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 0.3108 0.960 0.324 0.746 -1.570 2.192
## ar.L1.D.y -0.1743 0.368 -0.473 0.636 -0.896 0.547
## ma.L1.D.y 0.1985 0.366 0.542 0.588 -0.519 0.916
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -5.7387 +0.0000j 5.7387 0.5000
## MA.1 -5.0380 +0.0000j 5.0380 0.5000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 5162
## Model: ARIMA(1, 1, 1) Log Likelihood -24320.029
## Method: css-mle S.D. of innovations 26.908
## Date: Tue, 24 Nov 2020 AIC 48648.059
## Time: 11:39:50 BIC 48674.255
## Sample: 1 HQIC 48657.225
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 0.0766 0.399 0.192 0.848 -0.706 0.859
## ar.L1.D.y -0.0007 0.237 -0.003 0.998 -0.465 0.464
## ma.L1.D.y 0.0665 0.237 0.281 0.779 -0.397 0.530
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -1485.7047 +0.0000j 1485.7047 0.5000
## MA.1 -15.0293 +0.0000j 15.0293 0.5000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 3058
## Model: ARIMA(1, 1, 1) Log Likelihood -15798.297
## Method: css-mle S.D. of innovations 42.406
## Date: Tue, 24 Nov 2020 AIC 31604.593
## Time: 11:39:50 BIC 31628.695
## Sample: 1 HQIC 31613.254
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 0.7437 0.831 0.895 0.371 -0.885 2.372
## ar.L1.D.y -0.2225 0.127 -1.758 0.079 -0.470 0.026
## ma.L1.D.y 0.3245 0.122 2.654 0.008 0.085 0.564
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -4.4950 +0.0000j 4.4950 0.5000
## MA.1 -3.0812 +0.0000j 3.0812 0.5000
## -----------------------------------------------------------------------------
## ARIMA Model Results
## ==============================================================================
## Dep. Variable: D.y No. Observations: 3057
## Model: ARIMA(1, 1, 1) Log Likelihood -17217.768
## Method: css-mle S.D. of innovations 67.579
## Date: Tue, 24 Nov 2020 AIC 34443.535
## Time: 11:39:50 BIC 34467.636
## Sample: 1 HQIC 34452.196
##
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## const 1.7395 1.497 1.162 0.245 -1.195 4.674
## ar.L1.D.y -0.1671 0.071 -2.350 0.019 -0.307 -0.028
## ma.L1.D.y 0.4298 0.066 6.537 0.000 0.301 0.559
## Roots
## =============================================================================
## Real Imaginary Modulus Frequency
## -----------------------------------------------------------------------------
## AR.1 -5.9832 +0.0000j 5.9832 0.5000
## MA.1 -2.3266 +0.0000j 2.3266 0.5000
## -----------------------------------------------------------------------------
Notice that for the first 3 stocks, the model fit is not good. We can consider removing the MA term since it’s non-significant for some stocks, i.e. choosing the \(ARIMA(1, 1, 0)\) model.
mlist0 = [] # Models
flist0 = [] # Model fits
for i in names:
subdf = df1[df1['Symbol'] == i]
m = ARIMA(list(subdf['VWAP']), order=(1, 1, 0))
mlist0.append(m)
flist0.append(m.fit(disp=0))
We use AIC has the model choosing criteria. The AIC decreases for the first three stocks, and increases for the 4th and 5th, indicating different stocks need different models.
includema = [] # Whether MA term should be included
for i in range(50):
includema.append(flist0[i].aic > flist[i].aic)
pd.value_counts(includema)
## True 28
## False 22
## dtype: int64
Firstly, we can use the in-sample lagged values to predict the time series. We can plot the prediction results for the first 5 stocks.
fig, ax = plt.subplots(5, figsize=(15, 20))
for i in range(5):
if(includema): # ARIMA(1,1,1)
flist[i].plot_predict(dynamic=False, ax=ax[i])
else: # ARIMA(1,1,0)
flist0[i].plot_predict(dynamic=False, ax=ax[i])
ax[i].set_title(names[i])
fig.tight_layout()
fig
We can also forecast future VWAPs using the chosen models. For example, we can forecast the average prices in the next 200 trading days after the time series.
fig, ax = plt.subplots(5, figsize=(15, 20))
for i in range(5):
if(includema):
forecast, b, ci = flist[i].forecast(200, alpha=0.05)
else:
forecast, b, ci = flist0[i].forecast(200, alpha=0.05)
subdf = df1[df1['Symbol'] == names[i]]
ax[i].plot(list(subdf['VWAP']))
idx = range(len(subdf['VWAP']), 200+len(subdf['VWAP']))
ax[i].plot(idx, forecast)
ax[i].fill_between(idx, ci[:, 0], ci[:, 1],
alpha=0.15)
ax[i].set_title(names[i])
ax[i].set_xticks([])
fig.tight_layout()
fig
Notice the confidence intervals are very wide, indicating it’s not easy to forecast stock prices.
Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.
We can use auto_arima to choose models. It compares different models and chooses the best one. Again, we use ADF test to determine \(d\), and AIC to determine \(p,q\). Take “ADANIPORTS”, “ASIANPAINT” and “BPCL” as examples.
import pmdarima as paim
subdf = df1[df1['Symbol'] == names[0]]
m1 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
max_p=3, max_q=3)
m1.summary() # The chosen model was ARIMA(1, 0, 1), which is a good fit.
## <class 'statsmodels.iolib.summary.Summary'>
## """
## SARIMAX Results
## ==============================================================================
## Dep. Variable: y No. Observations: 3179
## Model: SARIMAX(1, 0, 1) Log Likelihood -13346.144
## Date: Tue, 24 Nov 2020 AIC 26700.287
## Time: 11:40:08 BIC 26724.545
## Sample: 0 HQIC 26708.987
## - 3179
## Covariance Type: opg
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## intercept 1.1366 1.144 0.993 0.321 -1.106 3.379
## ar.L1 0.9971 0.002 572.414 0.000 0.994 1.001
## ma.L1 0.0892 0.005 16.587 0.000 0.079 0.100
## sigma2 259.0152 0.418 619.397 0.000 258.196 259.835
## ===================================================================================
## Ljung-Box (Q): 52.63 Jarque-Bera (JB): 120511061.15
## Prob(Q): 0.09 Prob(JB): 0.00
## Heteroskedasticity (H): 0.06 Skew: -22.97
## Prob(H) (two-sided): 0.00 Kurtosis: 955.73
## ===================================================================================
##
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
ls0[0] # Indeed, differencing is not needed for the stock 'ADANIPORTS'.
## 0
subdf = df1[df1['Symbol'] == names[1]]
m2 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
max_p=3, max_q=3)
m2.summary() # The chosen model was ARIMA(0, 1, 0), which is a good fit.
# For 'ASIANPAINT', the 1-order difference series is close to a constant.
## <class 'statsmodels.iolib.summary.Summary'>
## """
## SARIMAX Results
## ==============================================================================
## Dep. Variable: y No. Observations: 5163
## Model: SARIMAX(0, 1, 0) Log Likelihood -29072.936
## Date: Tue, 24 Nov 2020 AIC 58147.873
## Time: 11:40:10 BIC 58154.422
## Sample: 0 HQIC 58150.164
## - 5163
## Covariance Type: opg
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## sigma2 4564.8605 1.982 2303.040 0.000 4560.976 4568.745
## ===================================================================================
## Ljung-Box (Q): 42.98 Jarque-Bera (JB): 3632346684.27
## Prob(Q): 0.34 Prob(JB): 0.00
## Heteroskedasticity (H): 4.65 Skew: -60.55
## Prob(H) (two-sided): 0.00 Kurtosis: 4110.73
## ===================================================================================
##
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
subdf = df1[df1['Symbol'] == names[7]]
m3 = paim.auto_arima(subdf['VWAP'], start_p=1, start_q=1, test='adf',
max_p=3, max_q=3, error_action='ignore')
m3.summary() # The chosen model was ARIMA(1, 1, 2), which is a good fit.
# For 'BPCL' second order MA term is needed.
## <class 'statsmodels.iolib.summary.Summary'>
## """
## SARIMAX Results
## ==============================================================================
## Dep. Variable: y No. Observations: 5163
## Model: SARIMAX(1, 1, 2) Log Likelihood -21047.069
## Date: Tue, 24 Nov 2020 AIC 42102.139
## Time: 11:40:33 BIC 42128.335
## Sample: 0 HQIC 42111.305
## - 5163
## Covariance Type: opg
## ==============================================================================
## coef std err z P>|z| [0.025 0.975]
## ------------------------------------------------------------------------------
## ar.L1 0.9703 0.019 51.948 0.000 0.934 1.007
## ma.L1 -0.9021 0.020 -45.739 0.000 -0.941 -0.863
## ma.L2 -0.0779 0.008 -10.104 0.000 -0.093 -0.063
## sigma2 203.7130 0.511 398.595 0.000 202.711 204.715
## ===================================================================================
## Ljung-Box (Q): 31.14 Jarque-Bera (JB): 97447842.12
## Prob(Q): 0.84 Prob(JB): 0.00
## Heteroskedasticity (H): 5.01 Skew: -18.28
## Prob(H) (two-sided): 0.00 Kurtosis: 675.11
## ===================================================================================
##
## Warnings:
## [1] Covariance matrix calculated using the outer product of gradients (complex-step).
## """
We can improve each of the models invidivually for slightly better forecast performance.
The main packages used in R tutorial are:
dplyr: For data programmingggplot2: For plots constructiongridExtra: For plots alignmentsforecast: For fitting ARIMA model & forecast# plot trend of all stocks
fig.cap1 = "**Figure 1.** *Daily trend of all stocks, 2000-2020.*"
nifty_ts = reshape2::melt(nifty_clean[, -c(2)], id.vars = "Date")
ggplot(nifty_ts, aes(x = Date, y = value)) +
geom_line(color= "deepskyblue4") +
theme_bw() +
facet_wrap(~ variable, scales = "free_y", ncol = 1)
Figure 1. Daily trend of all stocks, 2000-2020.
From the trend of all stocks, we can see the time series of exhibit non-stationarity. There was a substantial strike to the India stock market after the outbreak of Coronavirus.
R tutorial will focus on the variables: Symbol, VWAP, Volume, Trades and a newly created variable Return, which is the difference between Close and Prev Close.. First we choose one stock “ADANIPORTS” to analyze its ACF/PACF of trend on the above four variables. Normally, the choice of p and q in ARIMA(p, d, q) depends on ACF/PACF plots. The trend plot above shows huge volatility in VWAP, Volume, and Turnover, thus we can take log transformation to decrease its trend. The function for generation of ACF/PACF plots are ggAcf() and ggPacf() both under forescast package. You can choose to use plot.acf() under S3 method.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
candidate = "ADANIPORTS"
vars_list = c("VWAP", "Volume", "Turnover", "Return")
nifty_cand = nifty_clean %>%
filter(Symbol == candidate) %>%
mutate_at(vars(matches(c("VWAP", "Volume", "Turnover"))), log)
# plot ACF
fig.cap2 = "**Figure 2.** *ACF plots for stock: ADANIPORTS.*"
acf_list = vector(mode = "list", length = length(vars_list))
names(acf_list) = vars_list
for ( var in vars_list ) {
acf_list[[var]] = forecast::ggAcf(nifty_cand[[var]], lag.max = 60) +
ggtitle(var) +
theme_bw()
}
do.call("grid.arrange", c(acf_list, ncol = 2))
Figure 2. ACF plots for stock: ADANIPORTS.
Notice that all the variables show high autocorrelation except for Return, which is because Return is calculated from the first difference of closing price working as a linear filter applied to eliminate a trend. Since we are going to apply ARIMA model to the data, which can only works for stationary time series, let’s take first difference of other three variables and compare the autocorrelation plot to the previous one. Later, we will apply auto.arima() function in R, which works for non-stationary time series by apply appropriate times of difference to detrend data.
fig.cap3 = paste("**Figure 3.** *ACF plots for stock: ADANIPORTS.*",
"VWAP, Volume, and Trades have taken first difference.")
do.call("grid.arrange", c(acf_diff_list, ncol = 2))
Figure 3. ACF plots for stock: ADANIPORTS. VWAP, Volume, and Trades have taken first difference.
fig.cap4 = paste("**Figure 4.** *PACF plots for stock: ADANIPORTS.*",
"VWAP, Volume, and Trades have taken first difference.")
do.call("grid.arrange", c(pacf_list, ncol = 2))
Figure 4. PACF plots for stock: ADANIPORTS. VWAP, Volume, and Trades have taken first difference.
By looking at the ACF/PACF you can have a general idea which p and q value to choose. Take VWAP for example, both plots show the high ACF and PACF end on the second lag, suggesting that ARIMA(2, 1, 2) might be suitable for VWAP. However, the general eye bowling is not that precise and for variable like Return it is tricky to find p and q by ACF/PACF plots. As a result, we can use AIC as criteria to choose p and q.
Let’s tabulate some AIC values for a range of different choices of p and q, assuming d takes 0 for Return while 1 for other 3 variables. We will subset the last 120 time series as test data. Below shows the AIC table of fitting ARIMA on Return time series of stock: “ADANIPORTS”.
aic_table = function(ts, P, Q, d){
table = matrix(NA, (P + 1), (Q + 1))
for(p in 0:P) {
for(q in 0:Q) {
table[p + 1, q + 1] <- arima(ts, order=c(p, d, q))$aic
}
}
dimnames(table) = list(paste("AR", 0:P, sep = ""),
paste("MA", 0:Q, sep = ""))
table
}
# Construct AIC table
nifty_cand_ts = ts(nifty_cand$Return, frequency = 1, start = c(2000, 01, 03))
nifty_aic_table = aic_table(head(nifty_cand_ts, -30), 4, 4, 0)
## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1
## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1
## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1
## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1
## Warning in arima(ts, order = c(p, d, q)): possible convergence problem: optim
## gave code = 1
tab.cap2 = '**Table 2**. *AIC for different ARIMA parameters*'
nifty_aic_table %>%
knitr::kable(format = 'html', caption = tab.cap2) %>%
kableExtra::kable_styling('striped', full_width = TRUE)
| MA0 | MA1 | MA2 | MA3 | MA4 | |
|---|---|---|---|---|---|
| AR0 | 27514.98 | 27516.98 | 27518.81 | 27518.31 | 27518.78 |
| AR1 | 27516.98 | 27518.98 | 27520.96 | 27518.11 | 27519.28 |
| AR2 | 27518.83 | 27520.66 | 27514.92 | 27518.32 | 27477.74 |
| AR3 | 27517.96 | 27517.47 | 27478.51 | 27518.30 | 27520.65 |
| AR4 | 27518.41 | 27519.87 | 27518.41 | 27472.75 | 27514.69 |
The AIC table suggests that ARIMA(4, 0, 3) with the smallest AIC is the best model for the return of “ADANIPORTS”. This model may imply that increasing p and q will tend to get smaller AIC for a better fit. However, models with higher p and q are more complex, so it may lead to problems like overfitting, numerical stability and etc. We usually prefer a simply model, which also better for interpretation.
Even though it is nice to view the change of AIC value as the change of p and q, for a big data set like this, it is very inefficient to iterate over range of p and q. auto.arima()in the forest package is much faster in generating the results. It uses variant of Hyndman-Khandakar algorithm, which combines unit root test, minimizing AICc and MLE, and etc as evaluation criteria. auto.arima() on the training dataset for which the order specified is (4, 0, 2).
ts_arima = auto.arima(head(nifty_cand_ts, -30), max.p = 4,
max.q = 4, max.d = 3)
print(ts_arima)
## Series: head(nifty_cand_ts, -30)
## ARIMA(4,0,2) with zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ma1 ma2
## 1.4754 -0.7572 0.0017 0.0138 -1.4765 0.7656
## s.e. 0.2919 0.1707 0.0413 0.0362 0.2924 0.1669
##
## sigma^2 estimated as 364.2: log likelihood=-13751.28
## AIC=27516.55 AICc=27516.59 BIC=27558.94
The return equation can be written as: \[X_t = 1.475 X_{t-1}-0.757X_{t-2}+0.002X_{t-3}+0.014X_{t-4} -1.477\varepsilon_{t-1}+0.766\varepsilon_{t-2}\]
Lastly, we will test our model by forecasting the next 120 time series and compare the result with our test set.
ts_forecasts = forecast(ts_arima, h = 30)
acc = accuracy(ts_forecasts, head(tail(nifty_cand_ts, 30), 7))
print(round(acc, 4))
## ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
## Training set -0.0247 19.0664 6.8923 NaN Inf 0.7046 -0.0016 NA
## Test set 0.4030 5.6660 4.8247 96.1783 96.1783 0.4933 -0.3315 0.8505
The RMSE and MAE for the test set are 19.0664 and 6.8923, respectively. Furthermore, we could plot the residual plot of our forecast.
fig.cap4 = "**Figure 5.** *Residual Diagnosis*"
p1 = autoplot(ts_forecasts, main = "") + xlab("Day") +
ggtitle("Residuals of Forecast") +
ylab("Return") +
theme_bw()
p2 = ggAcf(resid(ts_arima)) + ggtitle("ACF of residuals") +
theme_bw()
grid.arrange(p1, p2, ncol = 2)
Figure 5. Residual Diagnosis
Unfortunately, the residual plot does not appear normal. It suggests the result is heavily tailed. As the ARIMA model that we applied takes the MLE approach with moment assumptions, our data clearly do not hold the Gaussian distribution. ACF of residuals indicates that there is correlation in the residuals series. Thus, our model fails to account for all available information.
One way to imprve the model is to take log-transform of the data. A second way is to apply the ARIMA model that fits t-distributed errors without assuming Gaussian white noise. A third way is to use data segmentation that takes interventions into consideration, as stock data is often affected by government policy.
Let’s try to take log transformation for Close and Prev Close prior to calculating the Return.
nifty_cand = nifty %>%
filter(Symbol == candidate) %>%
mutate_at(vars(matches(c("Close", "Prev Close"))), log) %>%
mutate(Return = Close - `Prev Close`)
nifty_cand_ts = ts(nifty_cand$Return, frequency = 1, start = c(2000, 01, 03))
ts_log_arima = auto.arima(head(nifty_cand_ts, -30), max.p = 4,
max.q = 4, max.d = 3)
print(ts_log_arima)
## Series: head(nifty_cand_ts, -30)
## ARIMA(1,0,1) with zero mean
##
## Coefficients:
## ar1 ma1
## -0.0951 0.0900
## s.e. 2.7882 2.7899
##
## sigma^2 estimated as 0.001757: log likelihood=5521.64
## AIC=-11037.29 AICc=-11037.28 BIC=-11019.12
auto.arima() suggests ARIMA(1, 0, 1) is the best fit for log returns.
fig.cap6 = "**Figure 6.** *Residual Diagnosis of log returns.*"
grid.arrange(p1, p2, ncol = 2)
Figure 6. Residual Diagnosis of log returns.
We can see after taken the log transformation, there seems no significant correlation in the residuals series and variation of residuals stays very much the same apart from two outliers. Consequently, We can now be confident about model forecasts, which appears to account for all available information, but prediction intervals that are computed assuming a normal distribution may still be inaccurate.
Firstly we visualize the data and the stock “ADANIPORTS” is taken as an example.
use NIFTY_clean, clear
keep if symbol == "ADANIPORTS"
graph twoway line vwap date, color("blue") xtitle("Days") ///
ytitle("Volume weighted average price")
graph export vwap_data.png, replace
graph twoway line volume date, color("blue") xtitle("Days") ytitle("Volume")
graph export vwap_data.png, replace
graph twoway line turnover date, color("blue") xtitle("Days") ///
ytitle("Turnover")
graph export vwap_data.png, replace
Figure 4.1. Time series plots of ADANIPORTS
We will use the time series VWAP for the analysis below.
For all stocks, we do Augmented Dickey-Fuller tests to determine whether the time series are stationary or not.
use NIFTY_clean, clear
foreach sym of local sbls {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
dfuller d1.vwap
}
We do the test on the with the first-order differentiation. All stocks are reporting minimum p-values, hence we decide to use \(d=1\) for all stocks.
Then, in order to find AR parameter \(p\) of the model, we generate the partial autoregressive (PACF) plots together with autoregressive (ACF) plots.
Note: we will only plot the first 8 stocks as an example.
use NIFTY_clean, clear
levelsof(symbol), local(sbls)
local sbls_f8 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV ///
BAJFINANCE BHARTIARTL BPCL"
foreach sym of local sbls_f8 {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
ac D.vwap
graph export acf_`sym'.png
pac D.vwap
graph export pacf_`sym'.png
}
The PACF plots for the first 4 stocks are the following:
Figure 4.2. ACF and PACF plots of ADANIPORTS
And the ACF plots for the next 4 stocks are the following:
caption
We can get the similar conclusion that lag 1 is absolutely significant while lag 2 is not, hencewe can choose \(p=1\) for the AR term and \(q=1\) for the MA term for all stocks.
According to the process above, we choose the \(ARIMA(1, 1, 1)\) for all stocks. However, diagnostics tells sometimes the \(ARIMA(1, 1, 0)\) performs better for some stocks. Hence, we try to use the better model to fit the data and then plot the predicted values against original values.
Note: we will only plot the first 5 stocks as an example.
use NIFTY_clean, clear
local sbls_f5 = "ADANIPORTS ASIANPAINT AXISBANK BAJAJ-AUTO BAJAJFINSV"
foreach sym of local sbls_f5 {
use NIFTY_clean, clear
keep if symbol == "`sym'"
tsset date
arima vwap, arima(1,1,1)
estat ic
mat l_aim = r(S)
scalar aic_aim = l_aim[1,5]
arima vwap, arima(1,1,0)
estat ic
mat l_ai = r(S)
scalar aic_ai = l_aim[1,5]
if aic_aim > aic_ai {
arima vwap, arima(1,1,0)
predict vwap_p
gen vwap_p = vwap_pd + vwap
graph twoway line vwap date, lwidth("vthin") color("blue")///
|| line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
graph export fitted_`sym'.png
}
else {
arima vwap, arima(1,1,1)
predict vwap_pd
gen vwap_p = vwap_pd + vwap
graph twoway line vwap date, lwidth("vthin") color("blue")///
|| line vwap_p date, lwidth("vthin") color("red") lpattern("dash")
graph export fitted_`sym'.png
}
}
The sample fitted graphs are:
caption
Now that we chose different models for different stocks, we can further improve the models by choosing the most proper model for each stock.
However, Stata does not have some similar functions as auto_arima to choose models automatically. Hence, we may related to other two languages ( Python, R). Heavy and tedious computation is expected in Stata here.
To conclude, in this tutorial we covered applying ARIMA model to forecasting stock related variables using Python, R and Stata. We also cross validate our results with actual data and suggest a model improvement method. Among all three programming language, Python and R are very powerful to model time series data with implement auto_arima()/auto.arima() function which Selects appropriate values for p, d and q automatically. For Stata, determination of parameters is mostly based on looking at ACF/PACF plots with trial and error. However, we should not blindly rely on automatic procedure. It is worthwhile to know how changes p, d and q affect the long-term forecasts as well as prediction intervals.